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Abstract 

As we aim at alleviating the curse of high-dimensionality, subspace learning is becoming more popular. 
Existing approaches use either information about global or local structure of the data, and few studies 
simultaneously focus on global and local structures as the both of them contain important information. In 
this paper, we propose a global and local structure preserving sparse subspace learning (GLoSS) model 
for unsupervised feature selection. The model can simultaneously realize feature selection and subspace 
learning. In addition, we develop a greedy algorithm to establish a generic combinatorial model, and an 
iterative strategy based on an accelerated block coordinate descent is used to solve the GLoSS problem. We 
also provide whole iterate sequence convergence analysis of the proposed iterative algorithm. Extensive 
experiments are conducted on real-world datasets to show the superiority of the proposed approach over 
several state-of-the-art unsupervised feature selection approaches. 
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1. Introduction 

With the advances in data processing, the dimensionality of the data increases and can be extremely high 
in many fields such as computer vision, machine learning and image processing. The high dimensionality 
of the data not only greatly increases the time and storage space required to realize data analysis but also 
introduces much redundancy and noise which can decrease the accuracy of ensuing methods. Hence, di¬ 
mensionality reduction becomes an important and often necessary preprocessing step to accomplish certain 
machine learning tasks such as clustering and classification. 
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Generally speaking, dimensionality reduction approaches can be divided into two classes: feature selec¬ 
tion and subspace learning. Feature selection methods aim to select a subset of most representative features 
following a certain criterion (e.g., Em 13 a a) , while subspace learning methods aim to learn a (linear 
or nonlinear) transformation to map the original high-dimensional data into a lower-dimensional subspace 
(e.g., 0131119]). Subspace learning methods, such as principal component analysis (PCA), combine all 
original features at each dimension of the learned subspace, and this causes some interpretation difficul¬ 
ties. To overcome this difficulty, sparse subspace learning methods (e.g., IfTOlfTTlfT^ ) and joint models that 
simultaneously perform subspace learning and feature selection (e.g., III3IIl[l3[l6l) have been developed. 

This paper exhibits the following main contributions: 

1. We propose a novel unsupervised sparse subspace learning model for feature selection. The model 
simultaneously perserves global and local structures of the data, both of which contain important 
discriminative information for feature selection, as demonstrated in iniiis]. We derive the model by 
first relaxing an existing combinatorial model and then adding a group sparsity regularization term. 
The regularization term controls the row sparsity of the transformation matrix, and since each row 
of the transformation matrix corresponds to a feature, the proposed model can automatically select 
representative features and makes easy interpretation. 

2. We, for the first time, propose a greedy algorithm to the original combinatorial optimization problem. 
In addition, we apply the accelerated block coordinate descent (BCD) method proposed in ifT^ to the 
relaxed continuous but nonconvex problem. The BCD method utilizes the bi-convexity structure of 
the problem and has been found very efficient for our purposes. 

3. We establish a whole iterate sequence convergence result of the BCD method for our problem under 
consideration by assuming the existence of a full rank limit point. Because of the peculiarity of the 
formulated problem, the result is new and not implied by any existing convergence results of BCD. 

4. We conduct extensive experimental studies. The proposed method is tested on six real-world datasets 
coming from different areas and compared to eight state-of-the-art unsupervised feature selection al¬ 
gorithms. The results demonstrate the superiority of the proposed method over all the other compared 
methods. In addition, we study the sensitivity of the proposed method to the parameters of the model 
and observe that it can perform in a stable way within a large range of values of the parameters. 

Organization and notation 

The paper is organized as follows. In Sect. we give a brief review of recent related studies on subspace 
learning. Sect, [previews two local structure preserving methods and proposes a local structure preserving 
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sparse subspace learning model. In Sect. we present an algorithm leading to the solution of the proposed 
model. Convergence results are also shown. Experimental results are reported in Sect. Finally, Sec. 
concludes this paper. 

To facilitate the presentation of the material, we list a notation in Table 


Table 1: Notation 


Notation 

Description 

n 

The number of instances 

d 

The number of features 

K 

The number of selected features 

A, 

The C'* row of the matrix A 

K 

The dimension of subspace 

m 

The number of nearest neighbors 

II^U||2,l 

Tii the sum of the ^ 2 -norm of rows in W 

llxllo 

0), the number of nonzero elements in vector x 

\I\ 

cardinality of set I 


2. Related Studies 

Subspace learning 

One well-known subspace learning method is principal component analysis (PCA) fTl l20ll . It maxi¬ 
mizes the global data structure information in the principal space and thus it becomes optimal in terms 
of data fitting. Beside global structure, local structure of the data also contains important discriminative 
information ll2n . which plays a crucial role in pattern recognition ll22ll . Many subspace learning methods 
preserve different local structures of the data for different problems and can yield better performance than 
the traditional PCA method. These methods usually use the linear extension of graph embedding (LGE) to 
preserve local structure. With different choices of the graph adjoint matrix, LGE framework leads to dif¬ 
ferent subspace learning methods. The popular ones include Linear Discriminant Analysis (LDA) ifTI I^ . 
Locality Preserving Projection (LPP) ll24l l25l l26l and Neighborhood Preserving Embedding (NPE) [D. 
One drawback of these locality preservation methods is that they require eigen-decomposition of dense 
matrices, which can be very expensive in both CPU time and machine storage, especially for problems 
involving high-dimensional data. To overcome this drawback, Cai et al. proposed a Spectral Regres¬ 
sion (SR) method to transform the eigen-decomposition problem into a two-step regression problem that 
becomes easier to solve. 
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Sparse subspace learning 

Although the subspace learning can transform the original high-dimensional data into a lower-dimensional 
space, it mingles all features and lacks interpretability. For better interpretability, sparse subspace learning 
methods have been proposed in the literature by adding certain sparsity regularization terms or sparsity con¬ 
straints into subspace learning models. For example, the sparse PCA (SPCA) QO) adds “Elastic Net” term 
into the traditional PCA. Moghaddam et al. 1271 proposed a spectral bounds framework for sparse subspace 
learning. Cai et al. lfT2l proposed a unified sparse subspace learning method based on spectral regression 
model, which adds an regularization term in the regression step. Qiao et al. 12^ introduced the Sparsity 
Preserving Projection (SPP) method for subspace learning, while SPP utilizes the sparsity coefficients to 
construct the graph Laplacian. It is worth mentioning that besides subspace learning, sparsity regularized 
methods have also been used in many other fields such as computer vision ||29ll30l, image processing ED, 
and signal recovery ll32l . 

Simultaneous feature selection and subspace learning 

Recently, joint methods have been proposed to simultaneously perform feature selection and subspace 
learning. The core idea of these methods is to use the transformation matrix to guide feature selection 
according to the norm of its row/column vectors. Cai et al. Ea combined the sparse subspace learning with 
feature selection and proposed the Multi-Cluster Feature Selection (MCFS) method. Because MCFS uses 
fi-term to control the sparsity of the transformation matrix, different dimensions of the learned subspace 
may combine different features, and thus the model lacks sound interpretability. Gu et al. m improved 
the MCFS method by using f’ 2 ,i-term to enforce the row sparsity of the transformation matrix. This way, the 
transformation matrix will have zero-rows corresponding to irrelavant features. Wang et al. m proposed 
an unsupervised feature selection framework, which uses the global regression term for subspace learning 
and orthogonal transformation matrix for feature selection. In general, the orthogonality constraint may 
limit its applications, as mentioned in ll33l in practice, feature weight vectors are not necessarily orthogonal 
to each other. In addition, the model discussed in m does not utilize local structure of the data. As 
demonstrated in lITTl . local structure of the data often contains essential discriminative information. 

Other related works 

There are some other related methods for subspace learning. Provided with only weak label infor¬ 
mation (e.g., preference relationships between examples), Xu et al. Il3^ proposes a Weakly Supervised 
Dimensionality Reduction (WSDR) method, which considers samples’ pairwise angles and also distances. 
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For the /T-means problem, Boutsidis et al. llBSl proposed randomized feature selection and subspace learn¬ 
ing methods and showed that a constant-factor approximation can be guaranteed with respect to the optimal 
/T-means objective value. Other popular subspace learning methods include: Nonnegative Matrix Factoriza¬ 
tion (NMF) ll^im that considers subspace learning of nonnegative data; joint LDA and /T-means 1781 that 
combines LDA and /f-means clustering together for unsupervised subspace learning; Dictionary Learning 
(DL) |[39l that first learns a dictionary via sparse coding and then uses the dictionary to decompose each 
sample into more discriminative and less discriminative parts for subspace learning. For more subspace 
learning methods, see PiOl and the references therein. 

3. The Proposed Framework of Local Structure Preserving Sparse Subspace Learning 

In this section, we introduce our feature selection models that encourage global data fitting and also 
preserve local structure information of the data. The first model is of combinatorial nature, only allowing 
0-1 valued variables. The modeling idea is intuitive and inspired from (11) of M, but it is not easy to 
find a good approximate solution to the problem. The second model relaxes the first one and becomes 
its continuous counterpart. Various optimization methods can be utilized to determine its solution. More 
importantly, we find that the relaxed model can most times produce better performance than the original 
one; one can refer to the numerical results reported in Section]^ We want to emphasize again here that our 
main contributions concern the second model and the algorithm developed for it. 

3.1. A Generic Formulation 

Given n data samples {p/)'Li located in the cZ-dimensional space, the goal of feature selection is to find 
a small set of features that can capture most useful information of the data which can better serve to solve 
classification or clustering problems. One natural way to measure the information content is to see how 
close the original data samples are to the learned subspace spanned by the selected features. Mathematically, 
the distance of a vector x to a subspace X can be represented as ||x - where Vx denotes the 

projection onto X and || ■ II 2 is the Euclidean 2-norm. Hence, the feature selection problem can be described 
as follows 

mm^-\\X-XWH\\l 

s.t. We{0,l)''^ML^lrfxi =l.xi, 

l|Wl,xillo = 
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where X - [pi, P 2 , • • •, p«]^ e Concerning the proposed model, we make a few remarks: 

1. The matrix W is the selection matrix with entries of “0” or “1”. The constraint W^ldxi - l/rxi 
enforces that each column of W has only one “1”. Therefore, at most k features are selected. 

2. The constraint HfTl^xillo = « enforces that W has k nonzero rows. No feature will be selected more 
than once, and thus exactly k features will be chosen. 

3. Given W, the optimal H produces the coefficients of all d features projected onto the subspace 
spanned by the selected features. Hence, Q expresses the distance of X to the learned subspace. 

The recent work m mentions to use the 0-1 feature selection matrix, but it does not explicitly formulate 
an optimization model like 0. As shown in m, local structure of the data often contains discriminative 
information that is important for distinguishing different samples. To make the learned subspace preserve 
local structure, one can add a regularization term to the objective to promote such structural information, 
namely, to solve the regularized model 

min - XWHWl + fiLoc(W) 

W,H 2^ 

s.t. IT e{0,l)‘'^MT^ 1,^X1 =l.xi, ^2) 

l|Wl,xilk 

where Loc(W) is a local structure promoting regularization term, and ju is a parameter to balance the data 
fitting and regularization. In the next subsection, we introduce different forms of Loc(lT). 

3.2. Local Structure Preserving Methods 

Local structure of the data often contains important information that can be used to distinguish the 
samples 10311241 . A predictor utilizing local structure information can be much more efficient than that only 
using global information ||2T1 . Therefore, one may want the learned lower dimensional subspace to be able 
to preserve local structure of the training data. We briefly review two widely used local structure preserving 
methods. 
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3.2.1. Local Linear Embedding 


The Local Linear Embedding (LLE) BTII method first finds the set NmiPj) of m nearest neighbors for 
all j and then constructs the similarity matrix S as the (normalized) solution of the following problem 

n n 

mm 

'=1 >1 (3) 

s.t.Sij = 0, V; i Vi. 


One can regard S ij as the coefficient of the 7 “'^ sample when approximating the sample, and the co¬ 
efficient is zero if the 7 “'* sample is not the neighbor of the L'* one. After obtaining S from ([^, LLE 
further normalizes it such that - 1- Then it computes the lower-dimensional representation 

Y - e through solving the following problem 

n n 

Yj (4) 

/=! 7=1 

Note that if VT is a selection matrix defined as (j^, VT^py becomes a lower-dimensional sample, keeping the 
K selected features by W and removing all other features. Let L - (I - 5)^(7 - S), where I is the n x n 
identity matrix. Then it is easy to see that 0 can be equivalently expressed as 

min TriW^X'^LXW). (5) 

w 


3.2.2. Linear Preserve Projection 

For the Linear Preserve Projection ll25l (LPP) method, the similarity matrix S is generated by 


S 


U ~ 


exp( 


-20-2 1 




0 


P; e NmiPj) or P 7 6 7V„,(p,) 


otherwise 


( 6 ) 


where Nmipd is the set of m nearest neighbors of p,-. The LPP method requires the lower-dimensional 
representation to preserve the similarity of the original data and forms the transformation matrix W by 
solving the following optimization problem 

n 

“vj" Y ^ - W^PjWi (7) 

i,7=i 
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Let L - D - S he the Laplacian matrix, where D is a diagonal matrix, called degree matrix, with diagonal 


elements D„ = S ij, Vi. Then Q can be equivalently expressed as 


min Tr(W^X^LXW}. 
w 


( 8 ) 


3.3. Relaxed Formulation 

The problem Q is of combinatorial nature, and we do not have many choices to solve it. In the next 
section, we develop a greedy algorithm, which chooses k features one by one, with each selection decreasing 
the objective value the most among all the remaining features. Numerically, we observe that the greedy 
method can often make satisfactory performance. However, it can sometimes perform very bad; see results 
on Yale64 and Usps in section]^ For this reason, we seek an alternative way to select features by hrst 
relaxing (|^ to a continuous problem and then employing a reliable optimization method to solve the relaxed 
problem. As observed in our tests, the relaxed method can perform comparably well with and, most of the 
time, much better than the original one. 

As remarked at the end of Section [TT] any feasible solution W is nonnegative and has k non-zero rows. 
\f K ^ d (that is usually satisfied), then W has lots of zero rows. Based on these observations, we relax 
the 0-1 constraint to nonnegativity constraint and the hard constraints W^ldxi - l;txi JlVkl^xillfo = to 
g{W) < K, where giW) measures the row-sparsity of W. One choice of giW) is group Lasso l42l,i.e.. 


^(W) = ^||1T,||2, 


(9) 


1=1 


where VT, denotes the /-th row of W. This way, we relax (|^ to 


min -||Y - XWHWj + pLoc(W) 
W,H 2. 


( 10 ) 


s.t. IT e Mf g{W) < K, 


or equivalently 

1 2 

min -||Y - XWHWl + yuLoc(lT) -H/?g(W) 
2 

s.t. IT 6 


( 11 ) 


where denotes the set of dx K nonnegative matrices, and yS is a parameter corresponding to k. Note 
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that W now also serves as a transformation matrix of subspace learning, and K is the dimension of the 
learned subspace. It is not necessary K - k. For better approximation by subspace learning, we will choose 
K > K. We will focus on 0 because it is easier than ( [TOl i to solve. Practically, one needs to tune the 
parameters p, (3, k, and K. As shown in section]^ the model with a wide range of values of the parameters 
can give stably satisfactory performance. 

Our model is similar to the Matrix Factorization Feature Selection (MFFS) model proposed in ifThl . 
The difference is that the MFFS model restricts the matrix W to be orthogonal while we use regularization 
term g{W) to promote row-sparsity of W. Although orthogonal W makes their model closer to the original 
model ([1]), it increases difficulty of solving their problem. In addition, MFFS does not utilize local structure 
preserving term as we do and thus may lose some important local information. Numerical tests in section 
[^demonstrate that the proposed model along with an iterative method can produce better results than those 
obtained by using the MFFS method. 

Before completing this section, let us make some remarks on the relaxed model. Originally, W is 
restricted to have exactly k non-zeros, so it could be extremely sparse as k d, and one may consider 
to include a sparsity-promoting term (e.g., ^i-norm) in the objective of O- However, doing so is not 
necessary since both giW) and the nonnegativity constraint encourage sparsity of W, and numerically we 
notice that W output by our algorithm is indeed very sparse. Another point worth mentioning is that the 
elements of W given by are real numbers and do not automatically select k features. For the purpose 
of feature selection, after obtaining a solution W, we choose the features corresponding to the k rows of W 
that have the largest norms because larger values imply more important roles played by the features. 


3.4. Extensions 

In ( [TT| ), Frobenius norm is used to measure the data fitting and typically suitable when Gaussian noise 
is assumed in the data and also commonly used if no priori information is assumed. One can of course use 
other norm or metric if different priori information is known. For instance, if the data come with outliers, 
one can employ the Cauchy Regression (CR) Il43l instead of the Frobenius norm to improve robustness and 
modify GD read as 


d n 

min > > In 1 

W,H ^ ^ 
j=\ 1=1 


H- 




+ luLociW) + f3g(W) 


s.t. W>0. 


( 12 ) 
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When the data involves heavy tailed noise, ll44l |45]| suggest to use the Manhattan distance defined by 
IIAIIm = ZU 27=1 and this way, ^ can be modified to 


min ||X - XWHWm + fiLoc(W) + (3g(W) 


(13) 


s.t. W>0. 


4. Solving the Proposed Sparse Subspace Learning 

In this section, we present algorithms to approximately solve (|^ and ( [TT] i. Throughout the rest of the 
paper, we assume that Loc(W) takes the function either as or and g{W) is given by (|^. Due to the 
combinatorial nature of (|^, we propose a greedy method to solve it. The problem is smooth, and 
various optimization methods can be applied. Although its objective is nonconvex jointly with respect to 
W and H, it is convex with regard to one of them while the other one is fixed. Based on this property, we 
choose the block coordinate descent method to solve O- 

4.1. Greedy Strategy for 

In this subsection, a greedy algorithm is developed for selecting k out of d features based on Q. The 
idea is as follows: each time, we select one from the remaining unselected features such that the objective 
value is decreased the most. We begin the design of the algorithm by making the following observation. 

Observation 1. Let I\ and I 2 be two index sets of features. Assume ly c 1 2 , and and Xj^ are 
submatrices ofX with columns indexed by I\ and 1 2 respectively. Then 


min ||X-Xjji/ll|7 > min \\X-Xi^H 2 \\\. 

H\ H2 


(14) 


From the above observation, if the current index set of selected features is J, the data fitting will become 
no worse if we enlarge I by adding more features. Below we describe in details on how to choose such 
additional features. Assume X is normalized such that 


||x ,||2 = l ,7 = l,...,ti. 


(15) 


where Xj denotes the column of X. Let I be the current index set of selected features. The optimal H to 
miu/f IIX - XjH\\f is given by 


H* = {XjXjfXjX, 


(16) 
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where ’ denotes the Moore-Penrose pseudoinverse of a matrix. Now consider to add one more feature into 
I, say the /* one. Then the lowest data fitting error is 


min ||X - XjH* - Xjh||^ 

h 

= mm\\h\\l-2{h,x]{X-XjH*)) + \\X-XiH*\\l 

h ^ 

= - ||xJ(X - XjH*)\\l + ||X - XjH*\\l, 


where the last equality is achieved at h = xj (X-XjH*). Hence, we can choose j such that ||xT {X—XiH*y\\2 
is the largest among all features not in I. 

Carrying out a comparison to ||xJ(X - XiH*)\\2, we find that ||xJ(X - can serve better. It turns 

out that the latter is exactly the correlation between Xj and the residual X - XjH*. Denote the correlation 
between x, and X as 

d 

Cor{Xi,X) = |x,^Xj|. 

.5=1 

As shown in ll46ll . if Cor(Xi,X) is large, then the columns of X can be better linearly represented by x,. To 
preserve local structure, we need also incorporate Loc(W)- If the set of selected features is I, then 


Loc(VT) = Tr(W~^X~^LXW) = ^x;"LXi. 


Assuming L - D - S, i.e., using the LPP method in section 3.2.2 (that is used throughout our tests), we 
have from ([T5|) that 


minxlLx; o maxxy^x,-. 

jii J jii J 


Therefore, we can enlarge I by adding one more feature index j* such that 


f 6 argmax Cor(Xj, X - XjH*) +x^SXj, 
iii 

where H* is given in ( [Th] ), and we have set // = 1 in (|^ for simplicity. Algorithm[2summarizes our greedy 
method, and for better balancing the correlation and local structure preserving terms, we normalize both of 
them in the 5th line of Algorithmic 
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Algorithm 1 Greedy Locally Preserved Subspace Learning (GLPSL) 


1 

2 

3 

4 

5 

6 

7 

8 


Input: Data matrix X e and the number k of features to be selected. 
Output: Index set of selected features I c . ,d] with \I\ - k. 
Initialize residual R - X, candidate set Q = { 1 ,2,..., li), selected set I 
for i - 1 to A- do 


i <— argmax/en 


Cor{Xi,R) 

XjenCor(xj.R) 


+ 


xJSXi 


Q <— Q\{i) and J = J U {i). 


R^X- XjiXjXjyxJX. 

end for 


4.2. Accelerated block coordinate update method for o 

In this subsection, we present an alternative method for feature selection based on Utilizing bi¬ 
convexity of the objective, we employ the accelerated block coordinate update (ECU) method proposed 
in m to solve O- As explained in lfT9l , ECU especially fits to solving bi-conve)([^ optimization prob¬ 
lems like ([TT|i. It owns low iteration-complexity as shown in section 4.3 and also guarantees the whole 


iterate sequence convergence on solving o as shown in section [4^ The whole iterate sequence conver¬ 
gence is important because otherwise running the algorithm for different numbers of iterations may result 
in significantly different solutions, which will further affect the clustering or classfication results. Many 
existing methods such as the multiplicative rule method only guarantee nonincreasing monotonicity of 
the objective values or iterate subsequence convergence, and thus our convergence result is much stronger. 

Following the framework of ECU, our algorithm is derived by alternatingly updating W and H, one at 
a time while the other one is fixed at its most recent value. Specifically, let 


f{W,H) = ^\\X-XWH\\l + ^TriW^X^LXW), 

8p(W)^mW\\2,v 


(17) 

( 18 ) 


At the k-th iteration, we perform the following updates: 


= argmin<Vv^/(Vk*, H% W-Wy+ -^\\W - W% + gpW) 


■rk rjk 




-k\\2 


W>0 


= argmin/(lU*+',//), 


(19a) 

(19b) 


^More precisely, in E), BCU is proposed to solve multi-convex optimization problems, which includes bi-convex problems as 
special cases. 
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where we take Lf^, as the Lipschitz constant of Viv/(Vk, with respect to W and 


W’^ + oJkiW^ - ( 20 ) 

is an extrapolated point with weight w*. € [0,1], Vk. 

Note that the //-subproblem ( |19b| l can be simply reduced to a linear equation and has the closed-form 
solution: 

^A:+I ^ (21) 

If H is restricted to be nonnegative, in general, ( |19b| i does not exhibit a closed-form solution. In this case, 
one can update H in the same manner as that of W, i.e., completing a block proximal-linearization update. 
In the following, we discuss in details on parameter settings and how to solve Vk-subproblem ( [T9al i. 

4.2.1. Parameter settings 

By direct computation, it is not difficult to have 

Vwf{W, H) = X^iXWH - X)H^ + nX'^LXW. (22) 


For any W, W, we have 

\\Xwf(W,H)-Xwf(W,H)\\F 

= WX'^iXWH - X)H~^ + fiX'^LXW - X'^iXWH - X)H~^ - giX'^ LXW\\f 

< WX'^iXWH - X)H'^ - X^iXWH - X)H^\\f + WfiX^LXW - ^X^LXWWf 
= \\X^X{W - W)HH^\\f + fi\\X^LX{W - W)||f 

< {WX^XMlHH^h + pWX^LX h) IIW - Wllf, 

where ||A ||2 denotes the spectral norm and equals the largest singular value of A, the first inequality follows 
from the triangle inequality, and the last inequality is from the fact ||AB||f < ||A|| 2 ||B||f for any matrices A 
and B of appropriate sizes. Hence, ||2f^2f||2||////^||2 + ^i\\X'^LX \\2 is a Lipschitz constant of Vvr/(Vk, H) with 
respect to W, and in (|19a[), we set 


Li = m^H’^fhWX^Xlk+m^LXh. 


(23) 
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As suggested in d, we set the extrapolation weight as 


tOk - min ibk, 601 


( 



( 24 ) 


where < 1 is predetermined and (bk - * with 



(l + -y/l +4f2_jj 


The weight Wk has been used to accelerate proximal gradient method for convex optimization problem (cf. 
|[47l '). It is demonstrated in Il48ll49l that the extrapolation weight in ( |24l ) can significantly accelerate BCU 
for nonconvex problems. 

Algorithm 2 Proximal operator for nonnegative group Lasso: W - Prox-NGL(F, d) 
for i - 1,.. .,ddo 

Let y be the row of Y and I the index set of positive components of y 
Set w to a zero vector 


if llyjib > 4 then 
Let wj = (||yj|| 2 -/l)ji^ 


end if 

Set the i'* row of IT to w 

end for 


4.2.2. Solution of W-subproblem 

Note that (|19a[) can be equivalently written as 



which can be decomposed into d smaller independent problems, each one involving one row of W and 
coming in the form 



(25) 


We show that ( |25] l has a closed-form solution and thus ( |19a| l can be solved explicitly. 

Theorem 1. Given y, let I — [i : y, > 0) be the index set of positive components ofy. Then the solution x* 
of ( |25| l is given as follows 

1 . For any i i I, x* - 0; 
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2. ^Ilyjlb < then x} = 0; otherwise, = (Hyjlb - ^)|j^- 


Proof. For i i I, we must have x* - 0 because if x* > 0, setting the i'^ component to zero and keeping 
all others unchanged will simultaneously decrease (x, - y,)^ and ||x|| 2 . Hence, we can reduce ( |25| ) to the 
following form 

min^||xj-yj||2 + d||xj||2. (26) 

xj>0 1 

Without nonnegativity constraint on xj, the minimizer of ( |26] l is given by item 2 of Theorem[T](for example, 
see ifSOll l. Note that the given x^ is nonnegative. Hence, it solves ( |26l l, and this completes the proof. □ 

The above proof gives a way to find the solution of ( [25] l. Using this method, we can explicitly form the 
solution of ( |19a[ ) by the subroutine Prox-NGL in Algorithm]^ where Y e and T > 0 are inputs, and W 
is the output. Arranging the above discusstion together, we have the pseudocode in Algorithm]^ for solving 

O- 


Algorithm 3 Global and Local Structure Preserving Sparse Subspace Learning (GLoSS) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

It 

12 

13 

14 

15 

16 


Input: Data matrix X 6 the number of selected features k and parameter /?, ji. 
Output: Index set of selected features I c {!,...,£/) with \I\ - k 
Initialize W® e e choose a positive number < 1; set k = 1. 

while Not convergent do 

Compute Lf,, and ut according to (23 1 and (24 1 respectively. 

Let = W* + ojkiW'^ - 


Update W‘ 


k+\ 


Prox-NGL(W* - ^). 


Update ^ (j^. 
if //*+') > F(W^ H'^) then 

Set W* = WK 


else 

Let k «— k + 1. 

end if 
end while 

Normalize each column of W. 

Sort ||VT,.|| 2 , i - 1,.. .,d and select features corresponding to the k largest ones. 


4.3. Complexity Analysis 

In this section, we count the flops per iteration of Algorithm]^ Our analysis is for general case, namely, 
we do not assume any structure of X. Note that if X is sparse, the computational complexity will be lower. 
The main cost of our algorithm is in the update of W and H, i.e., the 7'^’ and 8'^ lines in Algorithm 
For updating W, the major cost is in the computation of Xwf(W,H). Assume the dimension of subspace 
K < min((i, n). Then from (|22ll, we can obtain the partial gradient by first computing XW, HH^ and XH^ , 
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then XW(HH~^) and fiL(XW), and finally left multiplying X~^ to XW(HH~^) - XH~^ + jiLiXW). This way, 
it takes about 3ndK + dK^ + tiK^ + rP'K flops. To update H by ( |2T] ), we can use the same trick and obtain 
H in about 2ndK + tiK^ + dK^ + flops. Note that with XW and LX'W pre-computed, the objective 
value required in 9'* line can be easily obtained in about nd flops. Therefore, we have the per-iteration 
computational complexity of order OindK + rP'K) since K < min((i, n), and if K - Oil), then the algorithm 
is scalable to data size. 


4.4. Convergence analysis 

In this section, we analyze the convergence of Algorithm GLoSS. Let us denote 


L(W) 


0 , if IT > 0, 
+ 00 , otherwise 


to be the indicator function of the nonnegative quadrant. Also, let us denote 


FiW, H) = fiW, H) + gpiW) + L^iW)- 


Then the problem GD is equivalent to 

minFCIT,//), 

W,H 

and the first-order optimality condition is 0 e dFiW,H). Here, dF denotes the subdifferential of F (see 
ED for example) and equals XF if F is differentiable and a set otherwise. By Proposition 2.1 of E2l . 
0 e dF(W, H) is equivalent to 

0 e dwFiW, H), and 0 = XuFiW, H) 


namely. 


0 6 XwfiW, H) + dgniW) + 5t+(lT), 

(27a) 

0 = XHfiW,H). 

(27b) 


We call (IT, H) a critical point of ( [TT] l if it satisfies ( |27| l. 

In the following, we first establish a subsequence convergence result, stating that any limit point of the 
iterates is a critical point. Assuming existence of a full rank limit point, we further show that the whole 
iterate sequence converges to a critical point. The proofs of both results involve many technical details and 
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thus are deferred to the appendix for the readers’ convenience. 


Theorem 2 (Iterate subsequence convergence). Let {(VK^, be the sequence generated from Algo¬ 
rithm^ Any finite limit point of{(W^, is a critical point of ( |ll| l. 

Due to the coercivity of g{W) and the nonincreasing monotonicity of the objective value, [W^] must be 
bounded. However, in general, we cannot guarantee the boundedness of {//*) because may be rank- 
degenerate (i.e., not full rank). As shown in the next theorem, if we have rank-nondegeneracy of XW^ in 
the limit, a stronger convergence result can be established. The nondegeneracy assumption is similar to that 
assumed in ll53l section 7.3.2] and m for (higher-order) orthogonal iteration methods. 

Theorem 3 (Whole iterate sequence convergence). Let {(W^, be the sequence generated from 

Algorithm If there is a finite limit point {W, H) such that XW is full-rank, then the whole sequence 
{(W*,i/*))“ j must converge to (W,H). 

5. Experimental Studies 

In this section, the proposed methods GLPSL (Algorithm[T]) and GLoSS (Algorithm]^ are tested on six 
benchmark datasets and compared to one widely used subspace learning method PCA and seven state-of- 
the-art unsupervised feature selection methods. 


5.7. Datasets 

The six benchmark datasets we use come from different areas, and their characteristics are listed in 
Table 1^ Yale64, WarpPIE, Orl64 and Orlraw^are face images, each sample of the datasets representing a 
face image. Usps^ is a handwritten digit dataset that contains 9,298 handwritten digit images. Isolet^ is a 
speech signal dataset containing 30 speakers’ speech signal of alphabet twice. All datasets are normalized 
such that the vector corresponding to each feature has unit 72-norm. 


Table 2: The datasets detail 


Dataset 

It Instances 

It Features 

It Classes 

Type of Data 

Yale64 

165 

4096 

15 

Face image 

WarpPIE 

210 

2420 

10 

Face image 

Orl64 

400 

4096 

50 

Face image 

Orlraws 

too 

10304 

10 

Face image 

Usps 

9298 

256 

10 

Digit image 

Isolet 

1560 

617 

26 

Speech signal 


^http://featureselection.asu.edu/datasets.php 
^httpV/www.cad.zju.edu.cn/home/dengcai/Data/data.html 
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5.2. Experimental Settings 

Our algorithms are compared to the following methods: 

1. PCA: Principal component analysis (PCA) ||71 is an unsupervised subspace learning method that 
maximizes global structure information of the data in the principal space. 

2. LS: Laplacian score (LS) method Il24l uses the Laplacian score to evaluate effectiveness of the fea¬ 
tures. It selects the features individually that retain the samples’ local similarity specified by a simi¬ 
larity matrix. 

3. MCFS: Multi-cluster feature selection (MCFS) ifTSll is a two-step method, and it formulates the fea¬ 
ture selection process as a spectral information regression problem with ^i-norm regularization term. 

4. UDFS: Unsupervised discriminative feature selection (UDFS) method ll55l combines the data’s local 
discriminative property and the ^ 2 , 1 -norm sparse constraint in one convex model to select the features 
which have the highest power of local discriminative property. 

5. RSR: Regularized self-representation (RSR) feature selection method ll5^ uses the ^ 2 , 1 -norm to mea¬ 
sure the fitting error and also ^ 2 , 1 -norm to promote sparsity. Specifically, it solves the following 
problem: 

min||X-W|| 2 ,i +y 8 ||lU|| 2 ,i. 
w 

6 . NDFS: Nonnegative Discriminative Feature Selection (NDFS) method ifTSll utilizes the nonnegative 
spectral analysis with ^ 2 , 1 -norm regularization term. 

7. GFSPFS: Global and local structure preservation for feature selection (GLSPFS) method m uses 
both global and local similarity structure to model the feature selection problem. It solves the follow¬ 
ing problem: 

min ||y - XW\\l + pTr(W^X^LXW) + 

W 

8 . MFFS: Matrix factorization feature selection (MFFS) method M is similar to ours. It performs the 
subspace learning and feature selection process simultaneously by enforcing a nonnegative orthogo¬ 
nal transformation matrix W. This solves the following problem: 

minlwx - XWHWl 

W.H 2 |,28) 

s.t. W>Q, H>Q, W^W 

There are some parameters we need to set in advance. The dimension of the subspace is fixed to /T = 100 
for GLoSS method, and the number of selected features k is taken from {20,30,40,50,60,70,80,90,100) 
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for all datasets. We use the LPP method in section 3.2.2 to preserve local structure of the data in GLSPFS, 
NDFS, GLPSL and GLoSS because both MCFS and LS use the LPP Laplacian graph, and we set the 
number of nearest neighbors to m - 5 for LS, MCFS, UDFS, GLSPFS, NDFS, GLPSL and GLoSS. The 
parameter m is required by LS, MCFS, GLSPFS, NDFS, GLPSL and GLoSS to build a similarity matrix 
and UDFS to build the local total scatter and between-class scatter matrices. For simplicity, the parameter 


of local structure preserving term is fixed to p = 1 in GLSPFS and GLoSS for all the tests in Sections 5.3.1 


and 5.3.2 We study the sensitivity of GLoSS to p in Section 5.3.3 The sparsity parameter for UDFS, 
RSR, GLSPFS, NDFS and GLoSS is tuned from {0.01,0.1,1,10,40,70,100). After completing the feature 
selection process, we use the /T-means algorithm to cluster the samples using the selected features. The 
number of iterations of UDFS, GLSPFS, NDFS, MFFS, and GLoSS are set to 30. Because the performance 
of /f-means depends on the initial point, we run it 20 times with different random starting points and report 
the average value. 

The compared algorithms are evaluated based on their clustering results. For each sample of all datasets, 
we set its class number as the cluster number. To measure the clustering performance, we use clustering 
accuracy (ACC) and normalized mutual information (NMI), which are defined below. Let p, and q, be the 
predicted and true labels of the sample, respectively. The ACC is computed as 


ACC = 


2'Ll 6{qi,map{pi)) 


(29) 


where 6{a, b) - I if a - b and 6(a, b) — 0 otherwise, and map{-) is a permutation mapping that maps each 
predicted label to the equivalent true label. We use the Kuhn-Munkres algorithm ll57l to realize such a 
mapping. High value of ACC indicates the predicted labels are close to the true ones, and thus the higher 
ACC is, the better the clustering result is. The NMI is used to measure the similarity of two clustering 
results. For two label vectors P and Q, it is defined as 


NMI{P, Q) = 


KP,Q) 

^JmP)H{Qy 


(30) 


where I{P, Q) is the mutual information of P and Q, H{P) and H(Q) are the entropies of P and Q ll58ll . In 
our experiments, P contains the clustering labels using the selected features and Q the true labels of samples 
in the dataset. Higher value of NMI{P, Q) implies that P better predicts Q. 
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5.3. Experimental results 

In this subsection, we report the results of all tested methods. In addition, we study the sensitivity of the 
parameters present in O- 

5.3.1. Performance comparison 

In Tables and we present the ACC and NMI values produced by different methods. For each 
method, we vary the number of selected features among {20,30,40,..., 100) and report the best result. 
From the tables, we see that GLoSS performs the best among all the compared methods except for Yale64 
and WarpPIE in Table|^and Yale64 and Orl64 in Table|^ for each of which GLoSS is the second best. In 
addition, we see that the greedy method GLPSL performs reasonably well in many cases but can be very 
bad in some cases such as Usps in both Tables, and this justihes our reason to relax (|^ and develop GLoSS 
method. Linally, we see that GLoSS outperforms MLLS for all datasets, and this is possibly due to the local 
structure preserving term used in GLoSS. 


Table 3: Clustering results (ACC% ± std%) of different feature selection algorithms on different datasets. The best results are high¬ 
lighted in bold and the second best results are underlined. (The higher ACC is, the better the result is.) 


Dateset 

Isolet 

Yale64 

Orl64 

WarpPIE 

Usps 

Orlraw 

PCA 

47.90 ± 2.97 

32.79 ± 3.22 

33.75 ± 1.58 

39.95 ±4.37 

59.90 ± 3.89 

48.20 ± 3.68 

LS 

55.14 ± 3.15 

41.25 ± 3.28 

41.75 ± 1.71 

32.33 ± 1.03 

59.79 ± 2.72 

66.12 ±6.82 

MCFS 

54.95 ± 3.28 

44.88 ± 3.72 

50.75 ± 1.25 

50.38 ± 2.25 

66.55 ±3.11 

77.43 ± 7.15 

UDFS 

29.60 ± 2.73 

38.21 ± 3.02 

40.78 ± 1.03 

55.57 ± 2.92 

50.59 ± 1.97 

65.32 ±6.18 

RSR 

49.88 ± 3.75 

45.48 ± 3.34 

53.24 ± 1.83 

37.52 ± 2.23 

62.54 ± 2.34 

72.54 ± 6.52 

NDFS 

54.33 ± 3.73 

45.79 ±3.81 

49.85 ± 1.69 

34.10 ±3.81 

63.32 ± 3.35 

67.80 ± 6.48 

GLSPFS 

54.09 ± 3.22 

50.84 ± 5.34 

53.63 ± 2.62 

45.94 ± 2.38 

64.65 ± 3.69 

78.00 ± 7.47 

MFFS 

55.39 ± 3.32 

49.09 ± 3.64 

50.19 ± 1.64 

36.57 ± 2.32 

63.30 ±3.36 

73.55 ± 7.68 

GLPSL 

49.05 ± 3.02 

53.97 ± 3.45 

41.72 ± 1.05 

47.52 ± 1.87 

51.91 ± 2.18 

72.16 ±7.03 

GLoSS 

62.45 ± 3.58 

53.45 ± 3.88 

54.27 ± 1.87 

52.76 ±2.12 

67.24 ± 3.27 

79.37 ± 7.34 


5.3.2. Compare the performance with all features 

To illustrate the effect of feature selection to clustering, we compare the clustering results using all 
features and selected features given by different methods. Ligure[T] plots the ACC value and Ligurej^the 
NMI value with respect to the number of selected features. The baseline corresponds to the results using all 
features. Lrom the figures, we see that in most cases, the proposed GLoSS method gives the best results, and 
selecting reasonably many features (but far less than the total number of features), it can give comparable 
and even better clustering results than those by using all features. Hence, the feature selection eliminates 
the redundancy of the data for clustering purpose. In addition, note that using fewer features can save the 
clustering time of the /T-means method, and thus feature selection can improve both clustering accuracy 
and efficiency. 
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Table 4: Clustering results (NMI% ± std%) of different feature selection algorithms on different datasets. The best results are high¬ 
lighted in bold and the second best results are underlined. (The higher NMI, the better result is.) 


Dateset 

Isolet 

Yale64 

Orl64 

WarpPIE 

Usps 

Orlraw 

PCA 

61.48 ± 1.20 

41.43 ± 2.72 

58.57 ±0.86 

42.83 ± 3.82 

56.08 ± 1.54 

57.30 ± 3.93 

LS 

69.73 ± 1.43 

46.88 ± 2.07 

62.61 ± 1.53 

30.06 ± 2.89 

56.62 ± 0.95 

73.38 ±3.12 

MCFS 

69.82 ± 1.37 

53.70 ± 1.58 

69.33 ± 1.62 

54.37 ± 4.95 

61.01 ±0.92 

83.91 ± 3.53 

UDFS 

44.98 ± 1.02 

47.40 ± 1.64 

62.38± 1.41 

54.55 ± 4.38 

41.31 ± 1.21 

68.78 ± 3.45 

RSR 

63.47 ± 1.10 

56.08 ± 1.43 

72.33 ± 1.75 

41.81 ± 3.75 

55.32 ± 1.52 

83.96 ± 4.35 

NDFS 

70.05 ± 2.00 

54.67 ± 2.35 

70.42 ± 1.14 

28.16 ±4.45 

58.78 ± 0.99 

78.81 ± 3.99 

GLSPFS 

68.80 ± 1.07 

56.18 ± 3.40 

73.05 ± 1.52 

52.23 ± 4.42 

60.33 ± 1.65 

82.99 ± 4.73 

MFFS 

72.64 ± 1.73 

56.17 ±4.47 

70.65 ± 1.25 

40.95 ± 3.39 

59.11 ±0.76 

81.09 ±4.12 

GLPSL 

65.41 ± 1.23 

61.39 ± 1.72 

64.76 ± 1.50 

53.33 ± 3.89 

40.98 ± 0.87 

72.97 ± 3.37 

GLoSS 

74.28 ± 1.25 

58.87 ± 1.65 

73.02 ± 2.02 

55.76 ± 4.56 

61.29 ± 1.25 

85.65 ± 4.15 





Figure 1: The clustering accuracy (ACC) of using all features and selected features by different methods. 


5.3.3. Sensitivity of parameters 

To further demonstrate the performance of the proposed GLoSS method, we study its sensitivity with 
regard to the parameters K,p and /3 in ( [TT] i. First, we fix /r = 1 and vary k and (3. Figures]^ andplot the 
ACC and NMI values given by GLoSS for different k and jS’s. From the figures, we see that except for 
Isolet, GLoSS performs stably well for different combinations of k and /?, and thus the users can choose the 
parameters within a large interval to have satisfactory clustering performance. Secondly, we fix yS = 1 and 
vary k and p. Figures]^ and |^plot the ACC and NMI values given by GLoSS for different k and p’s. Again, 
we see that GLoSS performs stably well except for the Isolet dataset. 
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Figure 2: The normalized mutual information (NMI) of using all features and selected features by different methods. 




(a) Isolet 



(c) Orl64 




(e) Usps 


(d) WarpPIE 

Figure 3: Clustering accuracy (ACC) produced by GLoSS with different k and/?. 
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(a) Isolet 



(b) Yale64 



(c) Orl64 




(e) Usps 


(f) Oriraws 


Figure 4: Normalized mutual information (NMI) produced by GLoSS with different k and /?. 
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Figure 5: The clustering accuracy (ACC) given by GLoSS with different k and fi. 
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(d) WarpPIE 



# Feature o 


(e) Usps 



(f) Oriraws 


Figure 6: The normalized mutual information (NMI) given by GLoSS with different k and /i. 


6. Conclusions 

We have proposed a new unsupervised joint model on subspace learning and feature selection. The 
model preserves both global and local structure of the data, and it is derived by relaxing an existing com¬ 
binatorial model with 0-1 variables. A greedy algorithm has been developed, for the first time, to solve the 
combinatorial problem, and an accelerated block coordinate descent (BCD) method was applied to solve 
the relaxed continuous probelm. We have established the whole iterate sequence convergence of the BCD 
method. Extensive numerical tests on real-world data demonstrated that the proposed method outperformed 
several state-of-the-art unsupervised feature selection methods. 
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Appendix A. Proof of Theorem 

For simplicity, we assume at;. = 0, i.e., there is no extrapolation. The case of ^ 0 is more 

complicated but can be treated similarly with more care taken to handle details; see ||T9|| for example. 

The following result is well-known (c.f. Lemma 2.1 of lfT9ll ) 

F{W\h'‘) - - W% > y - W%, (A.l) 

where 

= ^i\\X^LX\\2 > 0. (A.2) 

By Lemma 3.1 of ll5^ . we have 

]^\\X - H% - i||A-(A.3) 

and 

- X), (A.4) 

where contains the left r leading singular vectors of XW^^^ and r is the rank of . 

Note that 


F(W'‘*\h’^) - F(W''^\h’^*'^) = ]^\\X-XW’^*''H%- i||A- 


F‘ 


Hence, summing ( |A.l[ t and ( |A.3| t over k and noting nonnegativity of F we obtain 


z 

k={) 


y||lL'=+' -W%+ I < 


and thus 

lim = 0. 

k^oo 

and 

lim -X)^ lim = 0. 

k—*oc k—*oc 


(A.5) 


(A.6) 
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Combining the two equalities in (|A.6|l, we have 


lim - A) = 0. 

k-^oo 


Since {XW'^} is bounded and {XWY^ - (XW’^Y, left multiplying {XW'^Y in the above equation 
gives 

\utl{XWY'"{XW''h'‘ - X) = 0. (A.7) 

k—*oo 

Assume {W,H) is a finite limit point of {{W'^, HY}Yi- Then there exists a subsequence HY}keK 
convergent to (W, H). If necessary, taking another subsequence, we can assume L* ^ L for some L > 0 as 
'K 3 k ^ oo. From ( |A.7[ l, it holds that 


XnfiW, H) = {XWYiXWH - A) = 0. 


In addition, from the update rule of W, we have 




l} 

argmin<V^/(lF^ H\ W-WY+ ^I|1T - W% + g^W). 
iv>o ^ 


Letting TC 3 k —» oo in the above equation and using (|A.5|l yield 


IF = argmin<Viv/(lT, H),W-W)+ YW- ■ 
iv>o 2 




which implies 


0 e Vwf(W,H) + dgfiiW) + diYW) = dwF(W,H). 


Therefore, (W, H) is a critical point of GD- 


Appendix B. Proof of Theorem 

For simplicity of notation, we let = (VT^, //*) and Z = (IT, H). In addition, we assume - 0, Vk 
as in the proof of Theorem]^ Again, the case of ^ 0 can be shown similarly. Let crnjin(AlT) > 0 be the 
smallest singular value of XW. By the continuity of singular value function and spectral norm of a matrix. 
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there exists 5 > 0 such that 


cr • (XW) 

cTn^iniXW) > and lIXiVlb < 2 IIWII 2 , 'iW € mW,6), (B.la) 

WHH^h < 2\\HH^\\2, V// e S(H,6), (B.lb) 

where crjni„(A) denotes the smallest singular value of matrix A, and S(A, S) {A ; ||A - A||f < d). 

Since F is a semi-algebraic function and continuous in its domain, it exhibits the so-called Kurdyka- 
Lojasiewicz property (c. f. EOl): in a neighborhood !B(Z,p), there exists (f>(s) = cs' ® for some c > 0 and 
0 < < 1 such that 


4>'(\F(Z) - F(Z)|)dist(0, dF(Z)) > 1, for any Z e S(Z,p) n dom(F) and F{Z) + F(Z). (B.2) 


Let 


Fk = F{Z’^) - FiZ), and 0, = 


Without loss of generality, we assume Z‘^ is sufficiently close to Z such that 


2||Z“-Z||f+ 3 

where is defined in (|A.2|l, and 


2Fo 


8F0 


0 - 2 . (XW) 

min' '' 


Cl ^ 


Cl =L5 + 2||////^||2||ZZ^||2+L;„ 

= y ^-8-■ 


In the above equation, Lg is the Lipschitz constant of Vwf(W, H) in S(Z, <5), i.e.. 


||Vw/(Z) - Vwf{Z)\\F < Ls\\Z - Z||f, VZ, Z 6 S(Z, 6). 


(B.3) 


(B.4) 

(B.5) 


(B.6) 


Note that Lg must be finite since f{W, H) is twice continuous differentiable and S(Z, 5) is bounded. Other¬ 
wise if does not hold, since Z is a limit point of {Z^), we can take an iterate Z^" sufficiently close to Z 
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and let be the new starting point. If neccessary, taking a smaller p, we assume 


(B.7) 

where 6 is the quantity in ( |B.l| i. 

From ( |A.l| l and < Fi^ < F{Z), 'ik, we have ||IF* - and thus 

\'2F 

||IF‘ - TFIIf < IIIF' - W°||f + ||VF“ - W\\f < ||VF° - W\\f + <p<6. (B.8) 

Hence, cr^i^{XW^) > from ( |B.la[ ), and 

F{W\tf)-F{W\H'') > )] ^^^1 _^0||2 > [o^nin(^B0]_||^i _ ^0||2 ^ 

2 8 

which implies ||//' - Therefore, 

II//1 - H\\f < l|H‘ - H^f + l|H“ - H\\f < l|H“ - H\\f + J- -(B.9) 

Combining ( |B.8| ) and ( |B.9| l, we have 

||Z‘ - ZIIf < HIT* - W\\f + ||//' - H\\f < 2\\Z° - Z||f + 
which together with ( |B.3| l implies Z' 6 !B(Z,p). 

Assume that for some integer K, Z^ e !B{Z,p), VO < k < K. We go to show Z^^' 6 !B(Z,p) and thus by 
induction Z^ 6 !B(Z,p), Vk. Note that 

0 6 + Li7\w'‘ - + dg/jiW'^) + dL+{W^), 

0 = 


Hence, 


dist(0,5F(Z')) <||V^/(W\H^) - + Li:\\W’^ - W’^'^Wf 

<Ci||Z^-Z'^-i||f, (B.IO) 
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where Ci is defined in (|B.4|i. In addition, we have 


0/t - 0yt+l 

>4> (Fk){Fk — Fk+\) ( from concavity of (p) 

^— ( from KL property (IB.21)) 

^ C 2 || Z ^+'- Z '=||2 

-Ci||Z^-Z^-i||f’ 


where the last inequality follows from (|B.5|l, (|A.l|l and 


F{W'^^\h’^)- 


F(W‘ 


■k+\ 




O’! 






■k+ln2 


Transforming (|B.l l[l gives 


CallZ'^^' - Z% < CilIZ' - Z'^-%{cp, - cPk^y) 

^ V^IIZ*^^' - Z% < ^Ci||Z^-Z^-i||f(0i-<^^+i) 

^ V^||Z'+' - Z% < ^||Z' - Z’^-^ + - (pk^i). 

2. 2 VC 2 


Summing the above inequality over k and arranging terms give 


Hence, 


y \\Z^^^ - Z'^llf < HZ' - Z°||f + ^{cPi - 0^+1). 

;,_1 ^^2 




liz^'^' - ziif < ^ Iiz'^^' - z'lif + HZ' - ziif 


A:=l 

^■1 


<HZ‘-ZHf+ Hz'-z'>Hf+ ;^(Ao 

ZC 2 

<2HZ'-Z«Hf+ HZ'’-Z|If+:^0o 

ZC 2 


(from ( |B.8[ l and ( |B.9[ l) < p. 


(B.ll) 


(B.12) 


(B.13) 


which indicates 6 !B{Z,p). By induction, we have Z*^ 6 !B(Z,p), 'ik, and thus ( |B.12| l holds for all 
K. Therefore, is a Cauchy sequence and converges. Since Z is a limit point, it must hold that 
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limt^oo = Z. This completes the proof. 
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